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Abstract 

The problem of Hybrid Linear Modeling (HLM) is to model and seg- 
ment data using a mixture of afBne subspaces. Different strategies have 
been proposed to solve this problem, however, rigorous analysis justifying 
their performance is missing. This paper suggests the Theoretical Spectral 
Curvature Clustering (TSCC) algorithm for solving the HLM problem, 
and provides careful analysis to justify it. The TSCC algorithm is practi- 
cally a combination of Govindu's multi-way spectral clustering framework 
(CVPR 2005) and Ng et al.'s spectral clustering algorithm (NIPS 2001). 
The main result of this paper states that if the given data is sampled from 
a mixture of distributions concentrated around afiine subspaces, then with 
high sampling probability the TSCC algorithm segments well the different 
underlying clusters. The goodness of clustering depends on the within- 
cluster errors, the between-clusters interaction, and a tuning parameter 
applied by TSCC. The proof also provides new insights for the analysis of 
Ng et al. (NIPS 2001). 

AMS Subject Classification (2000): 68Q32, 68T05, 62H30 (secondary: 68W40, 
60-99, 15A42) 

Keywords: Hybrid linear modeling, clustering d-flats, multi-way clustering, 
spectral clustering, polar curvature, perturbation analysis, concentration in- 
equalities 



1 Introduction 

The problem of Hybrid Linear Modeling (HLM) is to model data using a collec- 
tion of affine subspaces, or equivalently, flats, and simultaneously segment data 
into subsets representing those flats (see also formulations in [39] and [26]). This 
problem has diverse applications in many areas, such as motion segmentation 
in computer vision, hybrid linear representation of images, classification of face 
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images, and temporal segmentation of video sequences (see [26] and references 
therein) . Also, it is closely related to sparse representation and manifold learn- 
ing [35, 28]. 

Many algorithms and strategies can be applied to this problem. For example, 
RANSAC [9, 37, 42], /sT-Subspaces [16, 14]/i4:-Planes [5, 38], Subspace Separa- 
tion [8, 17, 18], Mixtures of Probabilistic PCA [36], Independent Component 
Analysis [15], Tensor Voting [30], Multi-way Clustering [2, 11, 32, 1], Gener- 
alized Principal Component Analysis [39, 26], Manifold Clustering [34], Local 
Subspace Affinity [41], Grassmann Clustering [12], Algebraic Multigrid [19], Ag- 
glomcrative Lossy Compression [25] and Poisson Mixture Model [13]. However, 
we are not aware of any probabilistic analysis of the performance of such algo- 
rithms given data sampled from a corresponding hybrid model (with additive 
noise) . The goal of this paper is to rigorously justify a particular solution to 
the HLM problem. 

For simplicity we restrict the discussion to the case where all the underlying 
flats have the same dimension d > 0, although our theory extends to mixed 
dimensions by considering only the maximum dimension. We also assume here 
that the intrinsic dimension, d, and the number of clusters, K, are known, and 
leave their estimation to future works. 

Our solution to HLM, the Theoretical Spectral Curvature Clustering (TSCC) 
algorithm, follows the multi-way spectral clustering framework of Govindu [11]. 
This framework (when applied to HLM) starts by computing an affinity mea- 
sure quantifying d-dimensional flatness for any d+2 points of the data. It then 
forms pairwise weights by decomposing the corresponding {d + 2)-way affin- 
ity tensor. At last, it applies spectral clustering (e.g., [33]) with the pairwise 
weights. However, these steps are based on heuristic arguments [11], with no 
formal justification for them. 

The TSCC algorithm combines Govindu's framework [11] with Ng ct al.'s 
spectral clustering algorithm [31], while introducing "the polar tensor" (see 
Subsection 2.4). We justify the TSCC algorithm following the strategy of [31] 
in two steps. First, we consider a general affinity tensor instead of the polar 
tensor, and control the goodness of clustering of TSCC by the deviation of 
the affinity tensor from an ideal tensor (Section 4). Next, we show that for 
a more restricted class of affinity tensors (also including the polar tensor) and 
data sampled from a hybrid linear model, the TSCC algorithm clusters the 
data well with high sampling probability (Section 5.2). For the polar tensor, 
the goodness of clustering can be expressed in terms of the within-cluster errors 
(which depend directly on the flatness of the underlying measures), the between- 
clusters interaction (which depends on the separation of the measures), and a 
tuning parameter applied by TSCC (Section 5). 

The rest of the paper is organized as follows. In Section 2 we review some 
theoretical background. In Section 3 we present the TSCC algorithm as a 
combination of Govindu's framework [11] and Ng et al.'s algorithm [31] while 
using the specific polar tensor. Both Sections 4 and 5 analyze the performance of 
the TSCC algorithm. The former section presents the main technical estimates 
for a large class of affinity tensors, while quantifying fundamental notions, in 
particular, the goodness of clustering. The latter section assumes a hybrid linear 
probabilistic model and the use of the polar tensor, and relates the estimates of 
Section 4 to the sampling distribution of the model. Section 6 concludes with a 
brief discussion and possible avenues for future work. Mathematical proofs are 
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given in the appendix. 

2 Background 

2.1 Notation and Basic Definitions 

Throughout this paper we assume an ambient space and a collection of 
d-flats that are embedded in R^, where < d < D. 

We denote scalars with possibly large values by upper-case plain letters (e.g., 
N, C), and scalars with relatively small values by lower-case Greek letters (e.g., 
a,e); vectors by boldface lower-case letters (e.g., u,v); matrices by boldface 
upper-case letters (e.g.. A); tensors by caUigraphic capital letters (e.g., A); and 
sets by upper-case Roman letters (e.g., X). 

For any integer n > 0, we denote the n-dimcnsional vector of ones by 1„, 
and the n x n matrix of ones by l„xri- The n x n identity matrix is written as 

In. 

The (z, 7)-clcmcnt of a matrix A is denoted by Aij, and the (ii, . . . ,i„)- 
element of an n-way tensor A is denoted by A{ii, . . . , in)- We denote the trans- 
pose of a matrix A by A' and that of a vector v by v'. The Frobenius norm of 
a matrix/tensor, denoted by ||-||p, is the £2 norm of the quantity when viewed 
as a vector. 

If fc > is an integer and A is a positive semidefinite square matrix, we 
use Ek{A) to denote the subspace spanned by the top k eigenvectors of A, and 
P'°(A) to represent the orthogonal projector onto Ek(A). 

If X G and F is a d-flat in M.^ , then we denote the orthogonal distance 
from X to by dist(x, i^). For any r > 0, the ball centered at x with radius 
r is written as B(x, r). If c > 0, then c • B(x, r) := B(x, c • r). If S is a subset 
of M^, we denote its diameter by diam(S) and its complement by S''. If S is 
furthermore discrete, we use |S| to denote its number of elements. 

Let be a measure on M.^ . We denote the support of fi by supp(^), its 
restriction to a given set S by /i[s, and the product measure of n copies of //., 
where n € N, by The d-dimensional Lebesgue measure is denoted by £4. 
Also, we use (M^)" to denote the Cartesian product of n copies of M^. 

We use P(n,r) to denote the number of permutations of size r from a se- 
quence of n available elements. That is, 



2.2 The Problem of Hybrid Linear Modeling 

We formulate here a version of the problem of HLM. We will introduce further 
restrictions on its setting throughout the paper. Before presenting the problem 
we need to define the notions of d-dimensional least squares errors and flats. 

If yit is a Borel probability measure, then the least squares error of approxi- 
mating /i by a d-flat is denoted by 62 (/x) and defined as follows: 



P(n, r) := n{n — 1) • • • (n — r -|- 1). 




(1) 



Any minimizer of the above quantity is referred to as a least squares d-flat. 
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We now incorporate the above definitions and present the problem of hybrid 

linear modeling below. 

Problem 1. Let fii,. . . , be Borel probability m.easures and assume that their 
d- dimensional least square errors {e2{jik)}k=i ^'"'^ sufficiently small and that 
their least squares d- flats do not coincide. Suppose a data set X = {xi, . . . , x^r} C 
generated as follows: For each k = 1,. . . ,K, points are sampled inde- 
pendently and identically from Hk, so that N = Ni + ■ ■ ■ + N^. The goal of 
hybrid linear modeling is to segment X into K subsets representing the underly- 
ing d-flats and simultaneously estimate the parameters of the underlying flats. 

We remark that the above notion of sufficiently small least square errors com- 
bined with non-coinciding least squares d-flats is quantified for our particular 
solution later in Subsection 5.2 (by restricting the size of the constant a of equa- 
tion (33)). We also remark that we restrict the above setting in Subsection 2.3 
by requiring the measures ni, . . . , to be regular and possibly d-separated 
(see Remark 2.5) and later in Subsection 4.2.1 by imposing the comparability 
of sizes of A''i, . . . , Nk (see equation (12)). 



2.3 The Polar Curvature 



For any d-\-2 distinct points Zi, . . . ,2.^+2 S M'^, we denote by Vd+ii'Zx, . . . ,2.^+2) 
the {d-\- l)-volume of the {d+ l)-simplex formed by these points. The polar sine 
at each vertex Zj, 1 < i < d + 2, is 



Psinz,(zi,...,Zd+2) := 



(d-l- 1)! • Vd+i(zi, . . . ,Zd+2) 



Y\\<3<d+2,ji^i ll^j 

Definition 2.1. The polar curvature of zi, . . . , Zcj+2 is 



(2) 



Cp(zi, . . . ,Zd+2) := diam({zi, . . . , 2.4+2}) ■ 



\ 



d+2 



Remark 2.2. The notion of curvature designates here a function of d -|- 2 vari- 
ables generalizing the distance function. Indeed, when d = 0, the polar curvature 
coincides with the Euclidean distance. We use this name (and probably abuse 
it) due to the comparability when d = 1 of the polar curvature with the Menger 
curvature multiplied by the square of the corresponding diameter (see [21]). 

Let fihe a Borel probability measure on M^. We define the polar curvature 
of to be 



Cp(zi,...,Zd+2) d/i(zi)... d/i(Zd+2). 



(4) 



The polar curvatures of randomly sampled {d + l)-simplices can be used to 
estimate the least squares errors of approximating certain probability measures 
by c?-flats. We start with two preliminary definitions and then state the main 
result, which is proved in [23] (following the methods of [24, 21, 22]). 
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Definition 2.3. Wc say that a Borcl probability measure /i on M is d-separated 
(with parameters < (5, w < 1) if there exist d + 2 balls {Bi}^^^ in M.^ with 
/i-measures at least 5 such that 

Vd(xi, , . . . , Xi^^ J > w • diam(supp(/i))'^, 

for any Xj^. e 2i?jj. , 1 < A; < d + 1 and 1 < ii < • • • < i^+i < d+2. 

Definition 2.4. We say that a Borel probability measure /i on is regular 
(with parameters and 7) if there exist constants 7 > 2 and > 1 such that 
for any x e supp(/i) and < r < diam(supp(/i)): 

Ai(B(x,r)) < C^r^. 

If 2? = 2 (or supp(/i) is contained in a 2-flat), then one can allow 1 < 7 < 2 
while strengthening the above equation as follows: 

C;V^ <M(i3(x,r)) <C^r^. 

Theorem 2.1. For any regular and d-separated Borel probability measure fj, 
there exists a constant C (depending only on the d-separation parameters, i.e., 
ui, 5, and the regularity parameters, i.e., 7, Cn) such that 

C"'-e2(M)<Cp(M)<C-e2(M)- (5) 
The following two curvatures also satisfy Theorem 2.1 [23]: 



Cdis(zi, . . . , Zd+2) := / inf V dist^(zi, F), 

\ / a— flats F 

y i<i<d+2 

Ch(zi, . . . ,Zd+2) := min dist(zi, F(j)), 

l<i<d+2 

where _F(;-) is the (d — l)-flat spanned by all the d + 2 points except z^. In 
this paper we use Cp as a representative of the class of curvatures that satisfy 
Theorem 2.1, since it seems computationally faster than the above two (using 
the numerical framework described in [7]). However, all the theory developed 
in this paper applies to the rest of the class. 

RemEirk 2.5. Since we use Theorem 2.1 in Subsection 5.3 to justify our pro- 
posed solution for HLM, we need to assume that the measures ^i, . . . , fix of 
Problem 1 are regular and d-separated. However, those restrictions could be 
relaxed or avoided as follows. If either cais or Ch is used instead of Cp, then 
Theorem 2.1 holds for mere d-separated probability measures (no need for reg- 
ularity). Moreover, in Subsection 5.3 wc may only use the right hand side of 
equation (5), i.e., the bound of Cp(/i) in terms of 62(1-1) (though it is prefer- 
able to have a tight estimate as suggested by the full equation). For such a 
bound it is enough to assume that /x is merely a regular probability measure. 
If we use instead of Cp any of the curvatures Cdis, Ch, then this latter bound 
holds for any Borel probability measure. We also comment that the regularity 
conditions described in Definition 2.4 could be further relaxed when replacing 
diam({zi, . . . , Zd+2}) in equation (3) with e.g., a geometric mean of correspond- 
ing edge lengths. More details appear in [23]. 
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2.4 Affinity Tensors and their Matrix Representations 

Throughout the rest of this paper, we consider {d + 2)-way tensors of the form 
{A{ii,...,id+2)} 

We assume that their elements are between zero and one, and invariant under 
arbitrary permutations of the indices {ii, . . . , 14+2}, i-e., these tensors are super- 
symmetric. 

Most commonly, we form the following affinities using the polar curvature: 

. . . re-'=p(^n.-,xi,+J/^^ if ii,...,id+2 are distinct; 

Ap{ii,...,id+2) ■■= <^ ^, . (6) 

1 0, otherwise. 

The corresponding tensor is referred to as the polar tensor. 

In the special case of underlying linear subspaces (instead of general affine 
ones), we may work with the following {d + l)-tensor: 

... . , rg-cp(o,x,,,...,x,,^j/a^ if are distinct; 

1 0, otherwise. 

In most of the paper we use the (rf + 2)-tensor Ap, while in a few places we refer 
to the {d + l)-tensor Ap^L- 

Given a (d + 2)-way affinity tensor A G ^nxNx---xn unfold it into an 
N X TV^"*"^ matrix A in a similar way as in [4, 20]. The i**^ row of A contains all 
the elements in the i^^ "slice" of A: {A{i, 12, - ■ ■ , id+2), 1 < • • • , id+2 < N}, 
according to an arbitrary but fixed ordering of the last d+1 indices (12, ... , id+2), 
e.g., the lexicographic ordering. This ordering (when fixed for all rows) is not 
important to us, since we are only interested in the uniquely determined matrix 
W := A A' (see Algorithm 1 below). 



3 Theoretical Spectral Curvature Clustering 

We combine Govindu's framework of multi-way spectral clustering [11] with Ng 

et al.'s spectral clustering algorithm [31], while incorporating the polar affini- 
ties (equation (6)), to formulate below (Algorithm 1) the Theoretical Spectral 
Curvature Clustering (TSCC) algorithm for solving Problem 1. 

We refer to this algorithm as theoretical because its complexity and storage 
requirement can be rather large (even though polynomial). In [7] we make the 
algorithm practical by applying various numerical techniques. In particular, 
we suggest a sampling strategy to approximate the matrix W in an iterative 
way, an automatic scheme of tuning the parameter a, and a straightforward 
procedure to initialize ii'-means for clustering the rows of U. 

The TSCC algorithm can be seen as two steps of embedding data followed 
by K-me&us. First, each data point Xj is mapped to A{i, :), the i"^ row of the 
matrix A, which contains the interactions between the point Xj and all rf-flats 
spanned by any d+1 points in the data (indeed, each column corresponds to 
d+1 data points). Second, Xj is further mapped to the i^^ row of the matrix 
U. The rows of U are treated as points in M^, to which /C-means is applied. 

The question of whether or not to normalize the rows of the matrix U is 
an interesting one. For ease of the subsequent theoretical development in this 
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Algorithm 1 Theoretical Spectral Curvature Clustering (TSCC) 

Input: X = {xi, X2, xat} C R^: data set, 

d: common dimension of flats, 

K: number of d-flats, 

cr: the tuning parameter for computing A 
Output: K disjoint clusters Ci, . . . , Ck- 
Steps: 

1: Construct the polar tensor using equation (6) and the given cr. 
2; Unfold Ap to obtain the affinity matrix A, and form the weight matrix 
W := A • A'. 

3; Compute the degree matrix D := diag{W- In}, and use it to normalize W 

to get Z := D-i/2 . w • D-^/^. 
4: Find the top K eigenvectors ui, U2, . . . , u^c of Z and define U := 



5: (optional) Normalize the rows of U to have unit length or using other meth- 
ods (see Subsection 4.3.1). 

6; Apply i^T-means [27] to the rows of U to find K clusters, and partition the 
original data into K subsets Ci, . . . ,Ck accordingly. 



paper we do not normalize the rows of U. Such a choice was also adopted in [7] 
where the TSCC algorithm yielded good numerical results. In Subsection 4.3.1 
we discuss more carefully the normalization of the matrix U and show the 
advantage of such practice. 

We remark that one can replace the polar tensor (applied in Step 1 of Algo- 
rithm 1) with other affinity tensors, based on the polar curvature or other ones 
that satisfy Theorem 2.1, to form different versions of TSCC. For example, when 
the underlying subspaces are known to be linear, one may use the (d+ l)-tcnsor 
^p.L of equation (7), forming the Theoretical Linear Spectral Curvature Clus- 
tering (TLSCC) algorithm. Another example is the following class of affinity 
tensors that are based on the powers of the polar curvature: 



where q> 1 (see Remark 5.4 for interpretation). While Algorithm 1 uses q = 1, 
its formulation in [7] uses g = 2, as the latter version of TSCC, when applied in 
an iterative way, converges faster. 

We justify the TSCC algorithm in two steps. In Section 4 we analyze the 
TSCC algorithm with a very general tensor (replacing the polar tensor), and 
develop conditions under which TSCC is expected to work well. In particular, 
the corresponding analysis applies to the polar tensor. In Section 5 we relate 
this analysis with the sampling of Problem 1, and correspondingly formulate a 
probabilistic statement for TSCC. The use of the polar curvature yields a clear 
explanation for the statement. 



[uiU2 . . . uk] e M" 




Foundations of a Multi-way Spectral Clustering Framework 



8 



4 Analysis of TSCC with a General Affinity Ten- 
sor 

Following a strategy of Ng et al. [31], we analyze the performance of the TSCC 
algorithm with a general affinity tensor (replacing the polar tensor in Step 1 of 
Algorithm 1) in two steps. First, we define a "perfect" tensor representing the 
ideal affinities, and show that in such a hypothetical situation, the K underlying 
clusters are correctly separated by the TSCC algorithm. Next, we assume that 
TSCC is applied with a general affinity tensor, and control the goodness of 
clustering of TSCC by the deviation of the given tensor from the perfect tensor. 
Finally, we discuss the effect of the two normalizations in the TSCC algorithm 
(Steps 1 and 1 of Algorithm 1). 



Notational Convenience 

We maintain the common setting of Problem 1 and all the notation used in the 
TSCC algorithm. 

We denote the K underlying clusters by Ci, . . . , Ck- Each Cfe has Nk points, 
so that TV = X]i<fc<A' ^k- For ease of presentation we suppose that Ni < N2 < 
■ ■ ■ < Nk, and that the points in X are ordered according to their membership. 
That is, the first iVi points of X are in Ci, the next points in C2, etc.. _ 

We define K index sets li,. . . ,1k having the indices of the points in Ci , . . . , Ck 
respectively, that is, 

Ife := {n e N I <n< ^j}' ^^^^ l<k<K. (9) 

l<j</c-l l<j</c 

We let u('), 1 < i < TV, denote the i*"" row of U and c^'^), l<k<K, denote 
the center of the fc*'' cluster, i.e., 

cW := ^ y uW). (10) 



4.1 Analysis of TSCC with the Perfect Tensor 

We define here the notion of a perfect tensor and show that TSCC obtains a 
perfect segmentation with such a tensor. 

Definition 4.1. The perfect tensor associated with Problem 1 is defined as 
follows. For any 1 < u, . . . , id+2 < N, 

~ . , I 1, if Xjj , . . . , Xj are distinct and in the same C/j; 

A[ix, . . . ,id+i) := \ 

I 0, otherwise. 

(11) 

We designate quantities derived from the perfect tensor A (by following the 
TSCC algorithm) with the tilde notation, e.g., A, W,D, Z,U. 

Remark 4.2. When d = 0, the perfect tensor A reduces to a block diagonal 
matrix, with the blocks corresponding to the underlying clusters. Ng et al. [31] 
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also considered an ideal affinity matrix with a block diagonal structure. How- 
ever, they maintained the diagonal blocks computed from the data, while we 
assume a more extreme case in which the elements of these blocks are identi- 
cally one (except at the diagonal entries). With our assumption it is possible 
to follow the steps of TSCC and exactly compute each quantity. 

Our result for TSCC with the perfect tensor A is formulated as follows (see 
proof in Appendix A.l). 

Proposition 4.1. // > d + 2 for all k = 1, . . . , K , then 

1. Z has exactly K eigenvalues of one; the rest are (jVfc-i)'(jVfc-d-i) ' 1 ^ ^ ^ 
K, each replicated N^ — l times. 

2. The rows of U are K mutually orthogonal vectors in . Moreover, each 
vector corresponds to a distinct underlying cluster. 

Remark 4.3. For the TLSCC algorithm, the corresponding jjerfect tensor Ai, 

is a ((i+ l)-dimensional equivalent of the ((i + 2)-way tensor A of equation (11). 
Proposition 4.1 still holds for Ai, but with d replaced hy d — 1. 

Example 4.4. Illustration of the perfect tensor analysis: We randomly 
generate three clean lines in 'E? and then sample 25 points from each line (see 
Figure 1(a)). We then apply TSCC with the polar tensor of equation (6) and 
a = .00001. The corresponding tensor is a close approximation to the perfect 
tensor, because taking the limit of equation (6) as ct — > 0+ essentially yields 
the perfect tensor. Intermediate and final clustering results are reported in 
Figures 1(b)- 1(d). 

In this case, the top three eigenvalues are hardly distinguished from 1, and 
the rest are close to zero (see Figure 1(b)). The rows of U accumulate at three 
orthogonal vectors (see Figure 1(c)), and thus form three tight clusters, each 
representing an underlying line (see Figure 1(d)). 

4.2 Perturbation Analysis of TSCC with a General Affin- 
ity Tensor 

4.2.1 Assumptions 

We assume that the underlying clusters have comparable and adequate sizes, 
more precisely, there exists a constant < ei < 1 such that 

Nk>mBs.{Ei-N/K,2d + 2,), k = l,...,K. (12) 

We also assume that all the affinity tensors A considered in this section are 
super-symmetric, and with elements between and 1. Moreover, they satisfy 
the following condition. 

Assumption 1. There exists a constant £2 > such that 



D > £2 D. 
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(c) rows of U (d) detected clusters 

Figure 1: Illustration of the perfect tensor analysis 



Remark 4.5. We feel the need to have some lower bound on D, possibly even 
weaker than that of Assumption 1, to ensure that the TSCC algorithm would 
work well. Indeed, for each i € Ifc, 1 < fc < the sum X^jeit measures 
the "connectedness" between the point and the other points in Cfe, and thus 
should be sufficiently large. Accordingly, since Da > X^jeifc^U'* ^ < 
k < K , these diagonal entries of the matrix D should be correspondingly large 
as well. In Subsection 5.4 we discuss the existence of this condition for the 
polar tensor while taking into account the restrictions on the tuning parameter 
(7 implied by Theorem 5.1. 



4.2.2 Measuring Goodness of Clustering of the TSCC Algorithm 

We use two equivalent ways to quantify the goodness of clustering of the TSCC 
algorithm when applied with a general affinity tensor A. In Subsection 4.3.1 we 
relate them to the more absolute notion of clustering identification error. 

We first investigate each of the K underlying clusters in the U space, i.e., 
{u'^'^jigi^, 1 < k < K , and estimate the sum of their variances. We refer to this 
sum as the total variation of the matrix U. 

Definition 4.6. The total variation of U (with respect to the K underlying 
clusters) is 



TV(U):= Elh^^'-^'" ' (13) 



l<k<Ki£lk 
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where c^^^ , . . . , c^^^ are the centers of the underlying clusters in the U space 
(see equation (10)). 

The smaller the total variation TV(U) is, the more concentrated the un- 
derlying clusters in the U space are. In fact, the following lemma (proved in 
Appendix A. 3) implies that the smaller TV(U) is, the more separated the cen- 
ters are from the origin and from each other. 

Lemma 4.2. 



J2 Nk- c^*^) =ii'-TV(U), (14) 

l<k<K ^ 

^ NkNe ■ (c«,c(^))2 < TV(U) . (15) 

l<k<e<K 

The other measurement of the goodness of clustering of TSCC is motivated 
by the fact that, in the ideal case, the subspace spanned by the top K eigenvec- 
tors of Z, EkCZi), leads to a perfect segmentation (see Proposition 4.1). When 
given a general affinity tensor A, the eigenspace Ek (Z) determines the cluster- 
ing result of TSCC. We thus suggest to measure the discrepancy between these 
two eigenspaccs, EkCZ) and EkCZ), by comparing the orthogonal projectors 
onto them, P^{Z) and P^{Z), in the following way. 

Definition 4.7. The distance between the two subspaces Sk(Z) and £^e:(Z) is 
dist{EK{Z),EK{Z)):= P^(Z)-P^(Z) . (16) 

F 

A geometric interpretation of the above distance is provided in the following 
lemma using the notion of principal angles [10]. We review the definition of 
principal angles and also prove Lemma 4.3 in Appendix A. 4. 

Lemma 4.3. Let < 9i < 02 < ■ ■ ■ < 0^: < n/2 be the K principal angles 
between the two subspaces Ek{Z) and Ek{Z). Then 

K 

dist2(£x(Z),£x(Z)) =2-^sin2 0fe. (17) 

fe=i 

At last, we claim that the above two ways of measuring the goodness of clus- 
tering of TSCC are equivalent in the following sense (see proof in Appendix A. 2). 



Lemma 4.4. 



dise{EK{Z),EK{Z)) = 2 • TV(U) . (18) 



4.2.3 The Perturbation Result 

Given general affinity tensor A we quantify its deviation from the perfect 
tensor A by the difference ^ 

£:=A-A. 

Our main result shows that the magnitude of this perturbation controls the 
goodness of clustering of the TSCC algorithm. 
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Theorem 4.5. Let A be any affinity tensor satisfying Assumption 1 and £ its 
deviation from the perfect tensor. There exists a constant Ci = Ci{K, d, ei, S2) 
(estimated in equation (71) of Appendix A. 5) such that if 



N' 



-(d+2) 



1 

8Ch' 



then 



TV(U) < Ci • iV-(''+2) \\£ 



(19) 



Remark 4.8. For the TLSCC algorithm, Theorem 4.5 holds with d replaced 
byd-1. 

Example 4.9. Illustration of the perturbation analysis: We corrupt the 
data in Figure 1 with 2.5% additive Gaussian noise (see Figure 2(a)), and apply 
TSCC with the polar tensor of equation (6) and a = 0.1840. In this case of 
moderate noise, the top three eigenvalues are still nicely separated from the 
rest, even though two of them deviate from 1 (see Figure 2(b)). The rows of 
U still form three clear clusters, but they deviate from concentrating at exactly 
three orthogonal vectors (see Figure 2(c)). The underlying clusters are detected 
correctly, except possibly for a few points at their intersection (see Figure 2(d)). 
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(a) data points 



10 20 30 40 50 60 70 



(b) eigenvalues of Z 
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(c) rows of U 



(d) detected clusters 



Figure 2: Illustration of the perturbation analysis 
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(a) The underlying clusters in the U, T, V spaces respectively 




(b) The clusters found by iC-means in the U, T, V spaces 



Figure 3: The underlying clusters and those found by K-means in the U, T, V 
spaces. The given data consists of 80 and 20 points on two lines in R^. We note 
that, in order for the rows of U to have similar magnitudes to those of T and 
V, we have scaled each row of U with the square root of the average cluster 
size, i.e., yjN/K. 



4.3 The Effects of the Normalizations in TSCC 

4.3.1 Possible Normalizations of U and Their Effects on Clustering 

The analysis of the previous subsections uses the embedding represented by 
the rows of U. It is possible to normalize these rows (e.g., by their lengths as 
in [31]) before applying K-xneans. In the following we consider two normalized 
versions of the rows of U, and analyze their effects on the TSCC algorithm (in 
comparison with the rows of U). 

Using the cluster sizes, or the row lengths, one could normalize the matrix 
U and obtain two matrices T, V whose rows are defined as follows: 

iVfc-u^*', iC,lk,l<k<K (20) 

^•uW, l<z<7V. (21) 
II2 

These two normalizations are explained as follows. The V normalization dis- 
cards all the magnitude information of the rows of U to contain only the angular 
information between them. The T normalization, containing the same angular 
information, reduces to U when Ni = ■ ■ ■ = Nk = N/K, and otherwise tries 
to further separate the underlying clusters by scaling the rows using the cluster 
sizes. See Figure 3(a) for an illustration of the U, T, V spaces. 

Remark 4.10. The normalization T assumes knowledge of the underlying clus- 
ter sizes, but can be effectively approximated without this knowledge when using 
our practical version of TSCC, i.e., SCC [7]. The SCC algorithm employs an 




Foundations of a Multi-way Spectral Clustering Framework 



14 



iterative sampling proeedure which converges quickly, thus it can estimate T in 
the current iteration by using the clusters obtained in the previous iteration. 

We view the matrix V as a weak approximation to T. Indeed, in the ideal 
case they coincide, since for all 1 < k < K, 



u 



(i) 



1 



i e Ife 



(see equation (48)). In the general case, the above equality only holds on aver- 
age. More precisely, the orthonormality of U implies that 



K 



K 



fe=i ieife 



= null 



■3\\2 



K. 



We next define two criterions for analyzing the performance of U, T and V 
when directly applying fC-means to them. 

First, we define a notion of the separation factor for the centers of the un- 
derlying clusters in each of the U, T and V spaces. The separation factor of 
the centers in the U space is defined as follows: 



/3(U) := 



Ei<k,-<k(c«,cW)^ 



llrWIr ' 

<k<K 11^ lb J 



(22) 



The separation factors /3(T),/3(V) arc defined similarly. The smaller j3 is, the 
more separated in the centers of the underlying clusters are. Lemma 4.2 
directly implies that /3(T) is controlled by TV(U) as follows. 



Lemma 4.6. 



TV(U) 



[K - TV(U))2 



We note that /3(U) = /3(T) when Nk = N/K, k = 1,...,K. In general, we 
observe that /3(U) < /3(T) < /3(V), with the former two being fairly close. For 
example, /3(U) = .0004, /3(T) = .0006, /3(V) = .0032 in Figure 3(a). In practice, 
however, we have found that the underlying clusters in the U, T, V spaces are 
usually not closely concentrated around their centers, thus this criterion may 
not be sufficient. 

Second, we define a notion of the clustering identification error in the U, T 
and V spaces respectively. For ease of discussion, we suppose that K = 2. In 
the U space, the corresponding error has the form: 



. 2 

k=l 



> 1/2' 



C«-c(2) 



} (23) 



The errors in the T, V spaces are defined similarly. The following lemma (proved 
in Appendix A. 7) shows that both eid(T) and eid(U) can be controlled by 
TV(U), with the former having a smaller upper bound. 

Lemma 4.7. Suppose that K = 2. If 



TV(U) < (V3 - 1 
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then the identification error in the T space is bounded above as follows: 

e. fT) < 4.TV(U) 

' - 2- TV(U) -2 VTV(U) ■ ^ ' 

If 

TV(U) < 

then the identification error in the U space is bounded above as follows: 

e.(U) < ^-^^(^^ , , (25) 

^ ' - 2- TV(U) -4/ei • VTV(U) ^ ^ 

w/iere f/ie constant £i «s defined in equation (12). 

We remark that the clustering identification errors eid(U), eid(T), eid(V) 
have only theoretical meanings. However, they can be used to estimate the 
clustering errors of i^T-mcans when applied in the U, T, V spaces respectively. 
We observed in practice that eid(T) and eid(V) are often very close. 

Following the above discussion we think that T is probably the right nor- 
malization to be used in TSCC. Its practical implementation should follow Re- 
mark 4.10. We note that the application of this normalization in Lemma 4.2 
results in analogous estimates for the T space which are independent of the sizes 
of clusters. Indeed, this normalization seems to outperform U when Ni, . . . , Nk 
vary widely (this claim is supported in practice by numerical experiments and 
in theory by Lemma 4.7). Another reason for our preference of T is that per- 
forming ii'-means in the T space is equivalent to performing weighted K-mear\s 
(with weights Nk/N, 1 < k < K) in the U space, which allows small clusters to 
have relatively larger variance (see e.g., Figure 3(a)). 

The V normalization is another possibility to use in TSCC. On one hand, it 
is a weak approximation to T; on the other hand, it contains only the angular 
information of the rows of U. The use of only angular information for iiT-means 
clustering, partly supported by the polarization theorem in [6], seems to also 
separate the underlying clusters further. However, we need to understand this 
normalization more thoroughly, i.e., in terms of theoretical analysis. 

In [7] we have used U to demonstrate our numerical strategies, which also 
apply to T and V, and obtained good numerical results. 



4.3.2 TSCC Without Normalizing W 

We analyze here the TSCC algorithm when the matrix W is not normalized, 
i.e., skipping Step 1 of Algorithm 1 and letting Z = W. We refer to the 
corresponding variant of TSCC as TSCC-UN, and formulate below analogous 
results of Proposition 4.1 and Theorem 4.5. The proof of Proposition 4.8 directly 
follows that of Proposition 4.1 in Appendix A.l (in particular, equations (42) 
and (43)). Theorem 4.9 is proved in Appendix A. 6. 

Proposition 4.8. Suppose that the TSCC-UN algorithm is applied with the 
perfect tensor A. Then 
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1. The eigenvalues o/W are > • • • > ^2 ^ c^i (each of multiplicity 1), and 
I'K > ■ ■ ■ > > vi (of multiplicity Nk, . . . , N2, Ni respectively), where 

dk:={Nk-d-l)-FiNk-l,d+l), (26) 
l^k:={d+l)-PiNk-2,d). (27) 

2. If di > vk, the rows of TJ are exactly K mutually orthogonal vectors, each 
representing a distinct underlying cluster. 

Theorem 4.9. Suppose that TSCC-UN is applied with a general affinity tensor 
A, and that 

K-l V /2kV^^ 



Let 



7V>^/2(rf+l)(l--^£i 1 ( — 1 , (28) 



'2/^\ 2(««+2) 

C2{K,d,ei,e2):=32(—j 



If 



then 



TV(U)<C2-7V-('^+2)||£:||2. (29) 



In view of equation (28), the TSCC-UN algorithm seems to require large 
data size in order to work well. Numerical experiments also indicate that this 
approach is very sensitive to the variation of cluster sizes, and works consistently 
worse than the normalized approach, i.e., TSCC. Our current analysis, however, 
does not manifest the significant advantage of the normalized approach. We thus 
leave the related exploration to later research. 

Von Luxburg et al. [40] have shown that in the framework of kernel spectral 
clustering, the normalized method is consistent under very general conditions. 
On the other hand, the unnormalizcd method is only consistent under very spe- 
cific conditions that are rarely met in practice. Since W can be seen as a kernel 
matrix, [40] provides another evidence for our preference of the normalized ap- 
proach. 

5 Probabilistic Analysis of TSCC 

In this section we analyze the performance of the TSCC algorithm with its 
own affinity tensor, i.e., the polar tensor of equation (6). We control with 
high probability (with respect to the sampling in Problem 1) the goodness of 
clustering of TSCC when applied to the data generated in Problem 1. 

5.1 Basic Setting and Definitions 

We follow the setting of hybrid linear modeling described in Problem 1 together 
with the assumptions of regularity and possibly d-separation of {/ii}^^ (see 
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Remark 2.5) as well as the restriction imposed by equation (12). We denote 
the corresponding N random variables by Xi, . . . , Xjv G and maintain the 
previous notation for their sampled values xi, . . . ,XAr. The joint sample space 
is (R^)^, and the corresponding joint probability measure is 

Mp:=M^ x-.-XM^-. (30) 

We introduce an incidence constant reflecting the separation between the 
measures fii, . . . , fxx in regard to the polar curvature Cp and the tuning param- 
eter a. We first define the following sets 

Sfc := {suppiiikjf^'^ , l< k< K. 
Then, given a constant ct > 0, the incidence constant has the form: 
Cin(Mi, . . . , HK;<y) '■= 

,^ max / •••/ e^^^-*^^-^^d/Xfci(zi)... d/Xfed+2(zd+2), (31) 

not all equal **+^ 

where the maximum is taken over all 1 < fci, . . . , kci+2 < K except fci = ^2 = 
• • • = kd+2- 

Remeirk 5.1. For TLSCC, the incidence constant is defined as follows: 
Cin,L(/Ul, • • • , IJ-K]cr) : = 

f f -Cp (0,zi ,...,z^_)_i) 

max / •••/ e - d/Xfe^ (zi) . . . d/ifc^+j (z^+i). (32) 



l<ki,...,ka^i<K Jo. Ja 



not all equal ' ""^ ' "'°<*+i 

We note that for both TSCC and TLSCC, the incidence constant is between 
and 1. The smaller the incidence constant is, the more separated (in terms of the 
polar curvature and the tuning parameter) the measures are. In Subsection 5.3 
we estimate the incidence constant in a few special instances of hybrid linear 
modeling. 

5.2 The Main Result 

The following theorem (proved in Appendix A. 9) shows that, when the underly- 
ing measures are sufficiently flat and well separated from each other, with high 
probability (with respect to the sampling of Problem 1) the TSCC algorithm 
segments the K underlying clusters well. 

Theorem 5.1. Suppose that the TSCC algorithm, is applied to the data gener- 
ated in Problem 1 with a tuning parameter a > 0. Let 



K 

1 

a : 



1 ^ 

^ clink) + CUnu ...,Hk; (t/2), (33) 



a 

k=l 



and C\ = C\{K,d, 81,82) he the constant defined in Theorem 4-5. If 

1 



a < 



I6C1' 
then 

(TV(U) < 2a • Ci I Assumption 1 holds) > 1 - e-2JV«'/(<^+2)'. (34) 
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Remark 5.2. Theorem 5.1 also holds for the TLSCC algorithm, but with d 
replaced by d — 1, and the constant a by 



1 ^ 

:= ^ X] 4,L('"fc) + C'in,L(Mi, (t/2), (35) 



where for any Borel probability nu^asure //. 



Cp,l(m) •= Y y c2(0,Zi,...,Zd+i)d/x(zi)... d/x(zd+i). 

Remark 5.3. A similar version of Theorem 5.1 holds for general affinity tensors 
of the form {e~'^^^'^'"''^'''+'^^^'^}i<i^^,,,^i^_^_2<N, where c is a nonnegative, symmet- 
ric function defined on The significance of using the polar curvature, or 
any other curvature satisfying Theorem 2.1, is explained in Subsection 5.3. 

We showed in Lemma 4.7 that the clustering identification errors eid(U) and 
eid(T) can be controlled by TV(U) when K = 2. Combining Lemma 4.7 and 
Theorem 5.1 yields the following probabilistic statement. 

Corollary 5.2. Suppose that K = 2, and that a, Ci are the constants defined 
in Theorem 5.1. If 

then 

( 4a Ci 

^lp eid(T) < -; , ^ I Assumption 1 holds 

V 1 — aCi — v2aGi 

> 1 _ ^-2Na''/(d+2)\ 

If 



then 

( 4aCi \ 

eid(U) < — I Assumption 1 holds 

V 1 — a d — 2/£i • v2 a d / 

> 1 _ e-2^«V(rf+2)\ 
5.3 Interpretation of the Constant a 

Theorem 5.1 shows the strong effect of the constant a on the goodness of clus- 
tering of the TSCC algorithm. This constant has two parts, which are explained 
respectively as follows. 

Theorem 2.1 implies that the first part of a is comparable to 




1 ^ 



fe=l 



We thus view the first part as the sum of the within-cluster errors of the model 
scaled by a^. 



Foundations of a Multi-way Spectral Clustering Framework 



19 



Remark 5.4. A similar interpretation applies to the tensors defined in equa- 
tion (8). In this case, for any g > 1, the first term of a is replaced by 

fc=i 

where for any Borel probability measure fj,, 




Zd+2) d/x(zi) . . . dii{zd+2)- 



The above sum is then comparable to 

1 ^ 
fe=l 

where e2q(/^fc) is the error of approximating Hk by a d-flat while minimizing the 
h2q norm [23]. 

We interpret the second part of a, i.e., the incidence constant, as the between- 
clusters interaction of the model. Unlike the first part, we do not have a theo- 
retical result that fully establishes this interpretation. We show in a few special 
cases (with underlying linear subspaces) how to control this constant. 

In the first example (Example 5.5) we estimate the incidence constant for 
two orthogonal line segments when using TSCC. The next three examples as- 
sume the use of the TLSCC algorithm. In Example 5.6 the model includes 
distributions along two clean line segments with an arbitrary angle between 
them. We establish the dependence of the incidence constant on 9 and a. In 
Example 5.7 we consider two orthogonal lines with uniform noise around them, 
and demonstrate the dependence of the incidence constant on the level of the 
noise and a. Example 5.8 considers two clean orthogonal planes in M^. 

Example 5.5. (TSCC: two orthogonal clean lines). We consider the fol- 
lowing two orthogonal line segments in M^: 

LI : 2/ = 0, < a; < L, 

and 

L2 : X = 0, 0<y < L, 

in which L > is a fixed constant. We assume arclength measures /zi = = 
supported on LI and L2 respectively. For any a > 0, the incidence constant 
for TSCC is bounded as follows (see Appendix A. 10): 

an(/xi,/X2;a)<-^(l-e-^^^/'^). (36) 

Example 5.6. (TLSCC: two intersecting clean lines). We consider the 
following two lines in M^: 

LI : y = 0, < a; < L, 

and 

L2 : y = rsinO, X = rcos0, 0<r<L, 
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in which L > and < < n/2 arc fixed constants. We assume arclengtfi 
measures /Ui = /X2 = "x" supported on LI and L2 respectively. For any cr > 0, 
the incidence constant for TLSCC is bounded as follows (see Appendix A. 11): 

^ , X „/ c \2 LsiaB / Lsin6'\\ , 

an,L(m,M2;a)<2(^) .^l-e-^(l + ^jj. (37) 

We note that when 9 = ^/2, Cin,L has a faster decay rate than dn (see Exam- 
ple 5.5). 

Example 5.7. (TLSCC: two orthogonal rectangles). We consider two 
rectangular strips in determined by the following vertices respectively: 

Rl : (e,0),(L + e,0),(e,e),(L + e,e), 

and 

R2: (0,e),(0,i + e),(e,e),(e,i + e), 

in which < e <C i. We assume uniform measures fii = -^7/^2 restricted to Ri, 
i = 1,2. We view Rl and R2 as two lines surrounded by uniform noise. Let 
oj := L/e. For any a > 0, the incidence constant for TLSCC has the following 
upper bound (see Appendix A. 12) 



an,L(/.r,M2;a)<4 + ^.e-V(^<^^^^)+e- 



1/' 



(38) 



In the limiting case of e ^ 0+, i.e., when having two orthogonal lines with 
practically no noise, the above estimate decays faster to zero than the one in 
Example 5.6 with 9 = tt/2. This is duo to the fact that in the current example 
we exclude the intersection of the two lines for any e > 0. As it turned out, 
the limit of the corresponding integral (as e — > 0+) is not the same as the full 
integral of this limit. 

Example 5.8. (TLSCC: two perpendicular clean half-disks). We con- 
sider the following portions of two unit disks (in polar coordinates) in M^: 

Dl : X = 0, y = pcos(p, z = psinip, < p < 1,0 < if < it, 

and 

D2 : X = rcos6», y = rsin6», 2; = 0, < r < 1, -7r/2 < ^ < 7r/2. 

We also assume uniform measures Hi = ^£2 restricted on Di, i = 1,2. In 
this case, the incidence constant for TLSCC is bounded above by the following 
quantity (see Appendix A. 13) 

G.,L(/.i,M2;a) < ^ + ^ + j-^,. (39) 

5.4 On the Existence of Assumption 1 

The theory developed in this paper assumes that all affinity tensors used with 

TSCC, in particular the polar tensor, satisfy Assumption 1. We present some 
partial results regarding the existence of this assumption for the polar tensor 
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while taking into account the restrictions on the size of a imposed by Theo- 
rem 5.1. We remark that those results also extend to some other tensors. 

We first show in the following lemma (proved in Appendix A. 8.1) that if a 
data is sampled from a hybrid linear model without noise, then Assumption 1 
is always satisfied with the constant £2 = 1. 

Lemma 5.3. // the TSCC is applied to data sampled from a mixture of clean 
d-fiats, then _ 

D > D. 

For more general data sampled from a hybrid linear model with respect to 
a ball B (according to Problem 1), one can easily obtain that Assumption 1 is 
satisfied with 

(see proof in Appendix A. 8.2). However, since our main estimates depend in- 
versely on £2 we would need to have the constant £2 sufficiently close to 1, and 
thus the above equation implies a lower bound on a of the order of diam(_B). 
On the other hand, the first term of the constant a stated in Theorem 5.1 im- 
plies an upper bound for cr of the order of X^f^i '^p(A*fe)- These two bounds are 
rather contradictory (it is easy to see this in view of the interpretation of the 
sum X^^j Cp(/i/j) in Subsection 5.3). 

To resolve the above issue we can replace diam(_B) in equation (40) with the 
term ^^^iCp{iik) and obtain the following estimate in expectation (see proof 
in Appendix A. 8. 3). 

Lemma 5.4. If the TSCC is applied to data sampled according to Problem 1, 
then Assumption 1 holds in expectation in the following sense: 

i^Mp(D) >£2-D, 

where 

Remark 5.9. We do not expect Assumption 1 to hold with high probability 
(i.e., having the measure /ip close to one) while maintaining the constant £2 
formulated in Lemma 5.4. However, it seems reasonable to have a statement in 
high probability when replacing the polar curvature Cp{fik) used in defining this 
constant with their following upper bounds: 

Cp(Mfe) = max / Cp(zi,Z2, . . . ,Zd+2) d/ife(z2) . . . d/ife(zd+2) . 

ziesupp(/*fe) J 

We leave the investigation of such a statement and the effect of using Cp{ii) 
instead of Cp (/x) to future research. 

6 Conclusion and Future Work 

We have analyzed the performance of TSCC in the setting of hybrid linear 
modeling. We first showed that we could precisely cluster the underlying com- 
ponents knowing the perfect tensor, and then established good performance in 
the case of reasonable deviation from the perfect case. Using this result, we 
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proved that if a data set is sampled independently and identically according 
to the setting of Problem 1, then with high sampling probability the TSCC 
algorithm will perform well as long as the underlying distributions are suffi- 
ciently flat and separated. In [7] wo develop a practical version of the TSCC 
algorithm by incorporating different numerical techniques and exemplify its suc- 
cessful performance using a number of artificial data sets and several real-world 
applications. 

We conclude this paper by discussing both the open directions and the pos- 
sible extensions of this work. 

Further understanding of the two normalizations discussed in Subsec- 
tion 4.3.1: We first explored in Subsection 4.3.1 possible normalizations of the 
matrix U, and analyzed (to some extent) the performance of TSCC with and 
without them. We concluded that the normalization suggested by the matrix 
T is probably the right one; to apply in TSCC. It will be interesting to test 
our practical strategy when applying such a normalization (see Remark 4.10) 
on both artificial and practical data sets with varying numbers of points within 
each cluster. Also, we wish to study more carefully the possible advantages of 
the normalization suggested by the matrix V. 

At last, Subsection 4.3.1 analyzed the TSCC algorithm when applied with- 
out the unnormalized matrix Z. The perturbation results there were practically 
comparable to those obtained when applying TSCC with the normalized matrix 
Z. It thus did not reveal the significant advantage of using Z. In future investi- 
gations we would like to improve the current estimates so that they emphasize 
this significant advantage. 

Further interpretation of the incidence constant: Currently we have 
described the behavior of the incidence constant in a few typical examples of 
two intersecting linear subspaces. We ask about characterization of this constant 

for general mixtures of flats, and its dependence on the separation between the 
subspaces, the magnitude of noise as well as the tuning parameter. 

Estimation of the clustering identification error: We showed in Sub- 
section 4.3.1 that when K = 2 and TV(U) is sufficiently small, then a large 
percentage of the points can bo clustered correctly. We would like to extend the 
corresponding analysis to the case where K > 2. 

Further investigation of Assumption 1: Assumption 1 is a crucial condition 
for Algorithm 1 to work well. Our partial results (i.e., Lemmas 5.3 and 5.4) 
showed that this assumption holds at least in expectation. We would like to 
explore the existence in high probability of Assumption 1 with a constant £2 > 
that does not contradict the bounds imposed by Theorem 5.1 (see discussion in 
Section 5.4, in particular, Remark 5.9). 

Analysis of other frameworks for multi-way clustering: Agarwal et al. [2] 

and Shashua et al. [32] suggested different frameworks for multi-way spectral 
clustering. It will be interesting to analyze the performance of their algorithms 
when applied to data sampled from a hybrid linear model. 

Clustering flats in non-flat spaces, and even more general shapes: 

We are interested in generalizing the problem of clustering d-flats in Euclidean 
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spaces to more general metrie spaces where d-flats arc replaced by d-dimcnsional 
geodesic surfaces. Also, we would like to modify our curvatures to cluster other 
shapes, e.g., circles, parabolas, spheres. 

Detecting d-flats: We believe that it is possible to modify the methods de- 
scribed in this paper to detect an unknown c?-flat in uniformly distributed back- 
ground noise. If we are able to develop good curvatures for other geometric 
shapes, then we can generalize the detection problem to including such shapes 
(see [3] and references therein for other solutions to this problem). 
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A Proofs 

A.l Proof of Proposition 4.1 

The affinity matrix A, the matricized version of A, is a 0/1 matrix of size 
N X N'^'^^. We identify the unit entries in each row as follows. For any fixed 
1 < i < A^i, the entries of the i**^ row of A are of the form A{i, «2, • • • , id+2), 1 < 
^2, . . . ,id+2 < N. These entries will be 1 if they represent affinities of distinct 
d + 2 points in Ci, that is, the indices 1,12, ■■■ , id+2 are distinct and between 
1 and A^i. Therefore, the i^^ row has exactly P(A'"i — l,d-|- 1) entries filled 
by a 1, which is exactly the number of permutations of size d + 1 out of the 
first A^i points excluding i. Similarly, each of the subsequent N2 rows has 
P(A^2 — 1, d + 1) ones, and each of the next rows has P(A^3 — l,d+l) ones, 

etc.. _ 

The weight matrix W = AA' can be expressed in terms of the tensor A in 
the following way: 



^ A{i,i2,...,id+2)A{j,i2,.--,id+2), l<i,j<N. (41) 
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If Xi and Xj are not in the same underlying cluster, then all the products are 
zero. Therefore, W is block diagonal: 



W = diag{W(i) , W(2) , . . . , W(-^' }, 



(42) 



where W^*^^ € M.^''^^'', corresponding to the underlying cluster Cfe, has the 
following form: 

(fe) ^ fp(iVfe-l,d+l), ifi=j; 
1 P(7Vfe - 2, d + 1), otherwise. 



(43) 



Indeed, the diagonal elements of W*^*^^ are simply the number of ones in the 
corresponding rows of A, and the off-diagonal elements are the number of ones 
that appear at the intersection of the corresponding pair of rows. 
It then follows that 



D = diag{W • 1} = diag{dilAri, d2ljV2, • • • , (IkInk}, (44) 



where 



dk = P(iVfe - 1, d + 1) + (iVfe - 1) • P(iVfc - 2, d + 1) 
= (iVfc-d-l)-P(iVfc-l,rf+l). 

The normalized matrix Z = D~^/^WD~^/^ is also block diagonal: 

Z = diag{ZW,Z(2),...,zW}, (45) 

where each block has the form Z^'^^ = 'W^'^^/di-, 1 < k < K. The (i, j)-element 
of ZW, for all l<i,j <Nk,is 



- JVfc-d-2 

(Nk-l){Nk-d-l)' 



if i = j; 
otherwise. 



(46) 



Straightforward calculation shows that each block matrix Z^^^ has two dis- 
tinct eigenvalues: 



d±l 

(iVfc-l)(iVfc-d-l)' 



if n = 1; 

if 2 < n < TVfe. 



(47) 



The eigenspace associated with the single eigenvalue 1 for Z'*^) is spanned 
by iNk, the A^fe-dimensional column vector of all ones. Since the eigenvalues 
and eigenvectors of a block diagonal matrix arc essentially the imion of those of 
its blocks (for eigenvectors we need to append zeros in an appropriate way), we 
conclude that Z has the largest eigenvalue 1 of multiplicity K with associated 
eigenspace spanned by the following K orthonormal vectors: 













( ' ] 


1 





1 




1 



















I j 




[ ) 
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We note that the K eigenvectors associated with the eigenvalue 1 can only 
be determined up to an orthonormal transformation. That is, 



u 



V 




















Qe 



tNxK 



(48) 



where Q is a. K x K orthonormal matrix. 

If we write Q = (qi, q2, ■ • ■ , Hk)', where qfc is the k^^ column of Q', then 
equation (48) implies that the K clusters are mapped one-to-one to the K 
mutually orthogonal vectors ^^y== • qi, . . . , • qjc € M^. 



A. 2 Proof of Lemma 4.4 



We firstjiote that P^(Z) = UU' and P^(Z) = UU', due to the fact that both 
U and U are composed of orthonormal columns. Therefore, 



P^(Z) -P^(Z) 



UU' - UU' 



trace ( UU' - UU 



trace ( UU' - UU'UU' - UU'UU' + VV) . 



Since 



and similarly. 



we have 



trace (UU') = trace (U'U) = trace(Iif ) = K, 



trace 



(uu') 



P^(Z) - P^(Z) ^^ = 2K-2- trace (uU'UU') . 



In the formula of the matrix U (equation (48)), there is an arbitrary or- 
thonormal matrix Q. However, the product UU' does not depend on Q. Hence, 
we can use a representation of U where Q is the identity matrix, and proceed 
as follows: 



P^(Z) - P^(Z) 



2K-2- U'U 



= 2K-2- 



= 2K-2.y4- 



= 2K-2-Y,Nk 



k=l 
K 



(49) 
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Since the columns of the matrix U are unit vectors, we have 



N 



K 



^ uW =||U||^ = ^||ufe||^ = if. 



(50) 



i=l fe=l 

Combining the last two equations we get that 



N 



P^(Z)-P^(Z) =2-^ uW -Y^Nk 



\i=l 



K 



= 2-E Eh 

fe=i Vieife 



fe=i 

2 



if 



.(fc) 



fe=i ieifc 



(51) 



A. 3 Proof of Lemma 4.2 



Equation (14) is a direct consequence of combining equation (49) and Lemma 4.4. 
To show equation (15), we first expand the following two products 



UU' = f(uW,u(^))') 
UU' = diagj-^ljVixJVi,-- •, -^Iatkxjv^I • 



(52) 
(53) 



Then 



UU' - UU' 

2 



|P^(Z) -P^(Z) 

= E E (("^^-^^■')-]^)^+ E E ((u«,uO)))^ 

> E E (("«,"^^'^))'- (54) 

i<fe5^«<xi6ifc,ie/^ 

We next apply the inequality (X^™ i dj)^ < to • i and conclude that 



P^(Z) - P^(Z) ^ > y TT- 



F ^ ^ NkNi 



E iV,iV,.(c«,cW)2. 



Finally, combining the last equation and Lemma 4.4 completes the proof. 
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A. 4 Review of Principal Angles and Proof of Lemma 4.3 
Review of Principal Angles 

The principal angles O<6i<---<0/f<7r/2 between two JT- dimensional 
subspaces S and T are defined recursively as follows (see e.g., [10]): 

cos ^1 = max max x'y = x', vi , 

xeS,||x||2 = l y6T,||y||2 = l 

cos^2 = max max x'y = Xoy2, 

xeS,||x||2 = l y6T,||y||2 = l 
x_Lxi y-Lyi 



COS 9k = max max xy = x^yif. 

xes,||x||2=i yeT,||y||2=i 
x_L{xi,...,xk-i} y-L{yi,- -,yjf-i} 

Another formula for the cosines of the principal angles is obtained as follows. 
Let S and T be two matrices whose columns define orthonormal bases of S and 
T respectively. Since any x G S* and y G T can be represented as x = Su and 
y = Tv respectively, where u and v are unit vectors in M^, it follows that 

cos6lfc = CTfe (S'T) for 1 < k < K, 

where ak (S'T) denotes the k^^ largest singular value of S'T. 

Proof of Lemma 4.3 

From the proof of Lemma 4.4 we have that 

K 

= 2K-2 U'U ^ =2K-2 

F 



P^(Z)-P^(Z) =2K-2 U'U = 27^-2^ 0-^ (u'u) 
^ ^ fe=i 

K K 

2K-2'Y^ cos^ 6k = 2'Y^ sin^ Ok- 



k=l fe=l 

A. 5 Proof of Theorem 4.5 

The proof is based on a perturbation result by Zwald and Blanchard [43, The- 
orem 3]. In fact, we only need a special case of it which is formulated below. 

Theorem A.l (Matrix version of Theorem 3 in Zwald and Blanchard [43]). Let 
S be a symmetric positive square matrix with nonzero eigenvalues Ai > • • • > 
> Xk+1 > • • • > 0, where K > is an integer. Define 5k = — A^+i > 0, 
which denotes the K*^ eigengap ofS. Let B be another symmetric matrix such 
that ||B||p, < 6k/4: and S + B is still a positive matrix. Then 

||P^(S + B)-P^(S)||p<2||B||p/<5k. (55) 
In order to apply the above theorem to the quantity P^(Z) — P^(Z) 

F 

we need a lower bound on Sk, the K^^ eigengap jaf Z, and an upper bound on 

the Frobcnius norm of the difference B := Z — Z. While the former bound is 
immediate, we find the latter bound somewhat challenging. 
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First, equation (47), together with Ni = mmi<i-<K -^fe, implies that: 

d+1 



Ok 



1 



(7Vi-l)(iVi-d-l)- 
Since A''i > 2{d + 1) + 1 by equation (12), we then obtain that 

2d + 3 3 



Sk > 



2d + 4 - 4' 



(56) 



(57) 



Next, we estimate the Frobenius norm of the perturbation B as foUows. 
Using the definitions of the matrices Z and W, we rewrite B in the following 
way: 

B = d-V2aa'd-V2 _ d-V2aa'd-V2. 

Regrouping terms gives that 
B = (d-V2a - D-i/^a) (d-i/^a - D-i/2a)' 

+ (d-i/2a - D-i/2a) A'D-1/2 + D-V2A (d-i/2a - D-i/^a)' . 
We thus get an initial upper bound on its Frobenius norm: 



l|B||p< 



D-V2A-D-1/2A 



+ 2 



D-1/2A 



D-V2A-D-1/2A 



(58) 



By using equations (45) and (46), we get that 



D-V2A 



= trace ( D-V2WD-V2) = trace (Z) = V ^ . 

k=l 



Equation (12) implies that 



Nk 



Nk-d-1 



< 2, l<k<K. 



Consequently, we have 
and thus equation (58) becomes 



D-V2A 



< 2K, 



l|B||p < 



D-1/2A-D-1/2A 



- 2V2if • 



D-1/2A-D-1/2A 



(59) 



Therefore, in order to control ||B||p, we only need to bound 
Let 

E := A- A. 

Replacing A with A + E yields that 

D-V2A - D- V2a||^ = II (d-1/2 _ D-V2) a + D'^ 



D-V2A-D-V2A 



/2E 



< 



(d-V2_d-V2) a||^ + ||d-i/2e 



(60) 
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The second term on the right hand side of equation (60) is bounded as follows 



D-V2E 



< 



D-i/2 
^£2(^1) 



2 

1/2 



•||E||p< 



(£2D)-V2 



• IIEII 



• IIEI 



F ' 



(61) 



in which the second inequality follows from Assumption 1 (D > > 0), and 
the last equality is due to our convention: A'"i = mini<fc<x iV^ (which implies 
that di = mini<fe<^ dh). 

Bounding the first term of the right hand side of equation (60) requires more 
work. We estimate it as follows: 

(d-V2_d-V2) .a|| = d-V2d-V2(^dV2 + di/2)"' (d-d) -A 

D-1/2 (£26) (b'/^y (d - d) • A 

F 

D-3/2 (^D - d) • A ^ . (62) 



< 



-1/2 



We proceed by using the index sets Ii, . . . , I^f (see equation (9)) to expand 
the last equation: 



(D-V2 _ D-V2) . a||^ < s^^^\i y: e(a.-4)^ 

V l<k<Kielk 



A(i,:) 



E E 



Da — rffc 



\ii^K{^, {Nk-d-l)-dl 
< S2^^^d^\Ni - d - 1)-^/^ ■ D-D 

Using the definitions of D and D, we obtain that 



(63) 



D-D 



W- W • 1 



< 



W- w 



• 111 



N\\2 



ArV2 . 



AE' + EA' + EE' 



< 7VV2 . (2 
and ( 

(following equation (12)) gives that 



F 

I|e|If + I|e|If 



)■ 



(64) 



Combining equations (63) and (64) and applying Ni — d — 1 > ^ > 



(d-V2_d-V2)a||^<(^)^^^-(2 



l|E|L + IIEII 



^) • (65) 



By substituting equations (61) and (65) into equation (60), we arrive at 

/ 2K N 



D-V2A-D-V2A 



< 



(2 



ei£2 



+ e-'/'d-'^'Mv 



IIEIL + IIEII 



(66) 
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In order to complete the above estimate for 



D-V2A-D-1/2A 



we need to 



estimate 



from above 



and rfi from below 



= (ATi - rf- 1) • P(iVi - 1) > {Ni/2f+^ > H 



2K 



d+2 



We also note that all the elements of the matrix E are between -1 and 1, and 
thus 

||E||p < 7V(''+2)/2. 

We then continue from equation (66), together with the last three estimates, 
and get that 



< 



D-V2A-D-V2A 

) 



£1^2 , 



2K 



-(d+2)/2 



lEll 



<4£-VM^j TV-C'^+^V^ ||E||p . 

Finally, it follows from equations (59) and (67) that 

m^<Co{K,d,er,e2)-N-^''+^y^ ||E||p , 

where 



1 



2d+5 



Co{K, d, £i, £2) := 16£2 ^ + 2V2K ■ 4£-'/' J 



2K\ 



d+5/2 



(67) 



(68) 



(69) 



By combining Theorem A.l with equations (57) and (68), we obtain that when 

Co(i^,d,£i,£2) • N-^'^+^y^ ||E|1f < 3/16, 

then 

■ p/f(z) - P^(Z) < 8/3 • Co{K,d,eue2) ■ iV-('^+2)/2 ||E||p . 



Ci{K, d, £1, £2) := 32/9 • C^{K, d, £1, £2), 



Letting 
and noting 

we complete the proof by combining Lemma 4.4 and equations (70) and (71) 



(70) 
(71) 



IeIIf^II^IIf' 
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A.6 Proof of Theorem 4.9 

The proof proceeds in parallel to that of Theorem 4.5. That is, we bound from 
below the K*^^ eigengap Sk of W, estimate from above the Frobenius norm of 
the perturbation B := W — W, and then conclude the theorem by combining 
these two bounds with Theorem A.l. 

Straightforward calculation shows that the matrix W (see formula in Equa- 
tion (43)) has the following eigenvalues: 



d-K > ■ ■ ■ > d2 > di and uk > ■ ■ ■ > 1^2 > I'l, 
where dk,l < k < K, are defined in equation (26), and 

Uk:={d+l)-P{Nk-2,d), k = l,...,K. 
Using equation (12) we obtain that 

eiN 

jVfc s JV - - ij • 

fe=i 



K-l 



NK = N-Y,Nk<N-{K-l) 



K \ K 

The above equation together with equations (12) and (28) implies that 

5k = di- VK 

d+2 



(72) 



> 



> 



eiN_ 
'2K 



d+2 



(Ci+l).(l-^£l) iV^ 



£lN 



d+2 



2\2K 

We follow by bounding the magnitude of the perturbation B = W — W: 



(73) 



l|B||p = 



AE' + EA' 



< I|A||fI|E||p + ||E||p 



< 2iv(''+2)/2 ||E||p . (74) 



Therefore, by combining equations (73) and (74) with Theorem A.l we conclude 
that when 

" - 16 \2KJ 

we have 



(W) - (W) 







< 8 


(f) 


F 





d+2 



N' 



-(d+2)/2 I 



El 



F ■ 



Theorem 4.9 is then a direct consequence of combining the above equation and 
Lemma 4.4. 



A. 7 Proof of Lemma 4.7 

In the T space the centers of the underlying clusters are 



Nk-c 



(fe) 



l<k<K. 



(75) 
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Applying Lemma 4.2 with K = 2 gives that 



.(1) _ ^(2) 



= iVi • 



,(1) 



+ A^2 • 



,(2) 



2^/7Vl]V2• (c(i),c(2)) 



When 



we can let 



> 2 - TV(U) -2^TV(U). 

TV(U) < (V3- l)', 
T := y^2-TV(U)-2v'TV(U). 



Then the clustering identification error of TSCC in the T space is bounded as 
follows: 

2 



eid(T)<l-^#{ieIfe 



fc=i 



> r 



For each = 1,2, we apply Chebyshev's inequality and obtain that 



Thus, 



TV — 

fe=i ieiji 



(fe) 



fe=i 



< -2 • TV(U) . 

In the U space, we also apply Lemma 4.2 with K = 2, together with the 
assumptions N2 > Ni > ei ■ N/2, and obtain that 



> 



(1) 



N2 



,(2) 



,(1) 



2-(cW,c(2)) 



+ iV2 



,(2) 



>^.(2-TV(U))-A/w(U) 
> 1 • (2 - TV(U) -4/ei . yTV(U)) • 



When 



TV(U)< ( + 



we can apply similar steps as above to obtain that 

4TV(U) 



eid(U) < 



2 - TV(U) -4/£i • i/TV(U) 
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A. 8 Proofs of Main Statements of Subsection 5.4 
A.8.1 Proof of Lemma 5.3 

For any 1 < k < K and i G Ik, we have 

Dii >^Wij>^ ^ A{i, i2, • • • , id+2)A{j, 12, ... , id+2) 
jeik jeifc j2v,»(j+2eifc 

Cp . . . . J + cp (^,. , ■ . ,^i^^2 ) 

= E E e ' • (76) 

jeife i2v>«<i+2eife\{i,j} 

and are distinct 

Whenjthe given data is noiseless, the polar curvature of any distinct d + 2 
points in is zero. Hence, 



Ai > E E 1 = "^k, 

jeifc i2,---,id+2eifc\{i,j} 

and arc distinct 



where dk (replicated times), 1 < k < K, are the diagonal elements of D (see 
equation (26)). We have thus proved that D > D. 

A.8.2 Proof of Equation (40) 

We note the following obvious bound on the polar curvature of any sampled 
points Xij, . . . ,Xjj^2 from the ball B according to Problem 1: 

Cp (xjj , , . . . , Xj^_^2 ) < diam(B) . Vd+2. 
Combining this bound with equation (76) we obtain that 

-V ^/(rF2 diam(B) Vd+2 diam(B) 2 v/(rF2 diam(B) ~ 

e '6 " =e ' -dk- 

jeik i2,---,id+2eik\{ij} 

and arc distinct 

A. 8. 3 Proof of Lemma 5.4 

Wc tak(^ th(^ expectation of each side of equation (76) with respect to the measure 
Hp (defined in equation (30)), and proceed using Jensen's inequality (twice) as 
follows: 

^^Mp(A.)>E E e"-^r-^(^-^---^^-^) 

jeik i2,---,id+2elk\{i,j} 
and arc distinct 



+ 2 '=p(^'>-^'2.---..^M+2) 



^E E ^" 

jeik t2,---,id+2elk\{i,j} 
and are distinct 

= e-*-'^»('''=) . dk, 
where in the last step we have used equation (4). Letting 

S2 ■= min e~^'''<'^^''^ = e~i'""^''^^''^'^''<'^'^''^ 
l<k<K ' 
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we have that _ 

i^/.p(Ai) > £2 -4, iGlk,l<k< K. 

Equivalently, _ 

£;^^(D) >£2-D. 

A.9 Proof of Theorem 5.1 

We first bound the expectation of the perturbation HfpUp, where £p = Ap — A, 
and then apply McDiarmid's inequahty [29] to obtain a probabihstic estimate for 
||fp||p. Finally, we conclude the proof by combining the probabilistic estimate 
together with Theorem 4.5. 

Using the definitions of the sets Ii , . . . , and the tensors and A, we 
express ||£^p||p as a function of the random variables Xi, . . . 

iw^ = EE(i-^^'^^)'+ E ^^^^^^^^^^ 

By applying the inequality: 1 — e~l^l < |a;|, we obtain that 

ii^pIIf < E E ^^^^^^^^2^^^ + E e "^'''"'w2''"'+''' . 

We then take the expectation of ||£^p||p (with respect to fXp) using equations (4) 
and (31) and have that 

K 



1 ^ 

i^Mpdl^pIlp) < - E (^,) + N'^+'a^in^, ...,hk; a/2) 



fe=i 

/ 1 / /v, A '^+2 

CpilJ'k) + Cin(Mi, • • • , Mif ; o'/2) 



= ^^"^-(^E 

\ fe=l 



iVfe 
N 



<a-N'^+'^, (77) 

in which 

1 ^ 

a:= — •E^p('"'fc) + ^in('"i'---''"^;'^/2)- 



We next note that for each fixed 1 < i < N, 



sup _ I II^pII^ (Xi, . . . . . . ,X;v)-||.^:pIIf (Xi, . . . . . . ,Xiv)| < (rf+2)-7V<'+i. 

Xi , . . . ,Xiv" 

Indeed, the number of additive terms in ||fp||p (Xi, . . . , Xn) that contain Xi is 
(d + 2) • P(A'' — 1, d + 1), and each of them is between and 1. 

The above property implies that ||^p||p satisfies McDiarmid's inequality [29], 
that is. 



(II^pIII - E,^{\\£p\\l) > aN'^^') < e-2^«V«i+2)=_ 



Foundations of a Multi-way Spectral Clustering Framework 



35 



Combining the last equation with equation (77) yields that 
or equivalently, 

Mp (^^-(-^+2) llfpll^ < 2a) > 1 - e-2^«V(c(+2)^ 

Consequently, combining Theorem 4.5 and the last equation gives that, if 

2a < -7;-, where Ci = Ci{K,d,ei,S2) is defined in equation (71), 
8C1 

then 

lip (TV(U) <2a-Ci\ Assumption 1 holds) 

> /Xp (tV(U) < 2a • Ci I Assumption 1 holds, and iV-('^+2) \\£^\\l < 2a) 

• Mp (Af-(''+2) llfpllp < 2a I Assumption 1 holds) 
= l-/xp (Af-('^+2) ||£:p|||<2a) 

> 1 _ g-2JVaV(d+2)=^ 

A. 10 Proof of Equation (36) in Example 5.5 

For any three points pi(xi, 0), P2(x2, 0) e LI, and q(0,j/) e L2, their polar 
curvature is bounded below by 

Cp(pi,P2,q) = diam{pi,p2,q} • ^J sin^ Zpip2q + siii^ Zpopiq + sin^ Zpiqp2 
> max ( y;^, ■ s[^^+^^ 

Thus, by using the symmetry of the lines, we obtain that 

III "^-pCPl :P2 

C\r,{\^\,pi2\o) = / / / e ^ d/xi(pi) d/xi(p2) d/Lt2(q) 

JLI JLI JL2 

A. 11 Proof of Equation (37) in Example 5.6 

For any two points p(a;,0) e Ll,q(rcos0, rsin^) e L2, the polar curvature of 
p, q and the origin o is bounded below by 

Cp(o, p, q) = diamjo, p, q} • \j s\v? 6 + sin^ Zopq + sin^ Zoqp 
> max(a;, r) • sin^. 
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Thus, the incidence constant is bounded above by 

f f _ (oiP^q) 

Cin,L(Mi,M2;cr) = / / e ^ d/ii(p)d^2(q) 



LI JL2 



^ '■^ ' ' ino da; dr 



-Jo L'' L L 

^ f f rsina da; dr 

JJo<x<r<L ^ ^ 

2 /"^ rsine dr 



\Lsme) 



1 - e ~ 1 + 



A. 12 Proof of Equation (38) in Example 5.7 

For any p{x,y2) G Rl,q(a;i,y) G R2, we define p(a;, e) G Kl,q{e,y) G R2. The 
polar curvature of p, q and the origin o is bounded below by 

Cp(o,p,q) > niax(||op||, ||oq||) • sinZpoq > m&x{x,y) ■ sinZpoq 

max(a;, y) ■ {xy — e^) 
~ V(a;2 + e2)(y2^^- 

Thus, the incidence constant is 

„ , \ f f - ^p'"'"'"^ da; dt/2 da;i dy 
<-'i„,L(/Ui,M2;<7) = / / e " — — 

1 (*L-\-e t*L-\-e max(a;,y)-(a;y — e^) 

<_.y j e -V(«^+e^)(s/^+e^) da;dy. 
Changing variables x := x/e, y := 2//e and setting lo := L/e gives that 
a„(/xi,M2;<T) < ^ -y y e .V(«^+i)(.^+i) dxdy. 

We observe that the integrand is bounded between and 1, symmetric about 
X and y, and decreasing in each of its arguments. We thus obtain that 



1 



an,L(Ml,M2;CT) < ^ • / / +2 



x(a;,i/)-(xy-l) 



e >.V(x^+i)(»^+i) dxdy 

( (!+»?). (1.(1+^)-!) - 
(^)' + 2.^.(w-^).e -V^F^^W) 

(1+ »?)-((!+ 6^)=^^!) 
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A. 13 Proof of Equation (39) in Example 5.8 

Let p(0, pcos Lp. p sin ip) G Dl, and qi (0, ri cos 6*1, ri sin^i), q2(0, r2 cos 92, r2 sin 612) € 
D2. Then the polar curvature of these three points and the origin o has the 
following lower bound: 

Cp(o,p,qi,q2) > |op| • psin„(p,qi,q2) = p-smip sin|6'i - 6'2|. 

Due to the symmetry of the two disks, we have that 

an,L(/Xi,M2;a)= / / / e-=p(°'P^ii''i^)/'^dMi(p)dM2(qi)ciM2(q2) 
Jdi Jd2 Jd2 

< /■■^/^ ^^^72 psinip.sin|9i-e2l p dp Aip AO I c\02 

~ Jo Jo 7-^/2 7-71/2 7r/2 TT TT 

Jo Jo JJ-f<e2<ei<f 

Changing variables := 9i — 62, O2 := 02 and exchanging the corresponding 
double integral, we obtain that 

4 /""^ p sin ysin 9 

Cin,i.{lJ.i,l^2;<y) < — ■ e " pdpd(p[Tr - 9)d9 

^ Jo Jo Jo 



JO J 

1 flV f-TV 



4: j j I P Sin ysin . 

< — • I I I e ' pdpd(pd6 



JO Jo 



= -^ff f ^ pdpdipd6. 

Jo Jo Jo 



We observe that the integrand is bounded between and 1, symmetric about 
if and 9, and decreasing in each of them. Thus, 



16 



an,L(Ml,At2;a) < ^ • / / / +2 



^ J ^J ^ 



g ^ ^ pdpd(pd9 

< 



i|.((y?)%2.^.(|-y5)).^'pdp 

16 /TT ,^\2 P (sin 



+ 2 



— O l~ 



TT (sin^)4" 
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